Detecting Co-Derivative Documents in Large Text Collections
نویسندگان
چکیده
We have analyzed the SPEX algorithm by Bernstein and Zobel [1] for detecting co-derivative documents using duplicate n-grams. Though we totally agree with the claim that not using unique n-grams can greatly increase efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file
منابع مشابه
Accurate discovery of co-derivative documents via duplicate text detection
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences,...
متن کاملMethods for Identifying Versioned and Plagiarised Documents
The widespread use of online publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarising the work of others. We evaluate two families of methods for searching a collection to find documents that are co-derivative, that is, are versions or plagiarisms of each other. The first, the ranking ...
متن کاملA Scalable System for Identifying Co-derivative Documents
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, ...
متن کاملPassage Selection To Improve Question Answering
Open-Domain Question Answering systems (QA) performs the task of detecting text fragments in a collection of documents that contain the response to user’s queries. These systems use high complexity tools that reduce its applicability to the treatment of small amounts of text. Consequently, when working on large document collections, QA systems apply Information Retrieval (IR) techniques to redu...
متن کاملDetecting Short Passages of Similar Text in Large Document Collections
This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008